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ABSTRACT 

The availability of vector processors capable of sustaining computing 
rates of 10 8 arithmetic results per second has raised the question of whether 
peripheral storage devices representing current technology can keep such 
processors supplied with data. By carefully examining the solution of a large 
banded linear system on these computers it is found that even under ideal con- 
ditions the processors will frequently be waiting for problem data. 
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SYSTEM BALANCE ANALYSIS FOR VECTOR COMPUTERS 


1. INTRODUCTION 

The availability of vector processors such as the Control Data Corporation 
STAR-100 and Texas Instruments, Inc. Advanced Scientific Computer (ASC) provides 
the scientific community with considerably expanded computing power. These 
processors are able to approach sustained rates of 10® arithmetic results per 
second and this raises the question of whether available peripheral storage 
devices can keep such processors supplied with problem data. It is frequently 
true that even with a third generation scalar system, central processor 
efficiency is limited by the performance of the associated peripheral equip- 
ment. However, improved peripheral hardware and improved operating system 
performance have made this problem manageable. 

A study by Lynch [1974] using a simple model of matrix multiplication as a 
representative problem has indicated that conventional rotating bulk storage 
probably cannot supply data to vector processors fast enough to keep them from 
being idle most of the time. In the present paper a detailed analysis of the 
input/output (I/O) requirements for a problem of interest to the scientific 
community, that of solving a banded linear system, is given for an algorithm 
specifically designed for vector computers. The results obtained show that 
even under ideal conditions, guaranteeing an adequate supply of data for the 
processor is difficult with present technology. 

In the following section the central processor of a vector computer is 
described and a model for rotating bulk storage is developed. In Section 3 
the particular application problem including the solution algorithm is dis- 
cussed. Section 4 includes a timing analysis for the two vector computers 
mentioned above and Section 5 contains several implications of this study. 


2. HARDWARE .. .. 

2.1 Central Processor 

For the purposes of this paper, it is assumed that the arithmetic section 
of a vector computer contains one or more segmented execution units or 
"pipelines" which are used for vector arithmetic. A pipeline is functionally 
divided into a sequence of subunits or segments, each of which performs only 
part of a given arithmetic operation. A vector instruction initiates the 
flow of data from one or two source vectors to the pipeline. Assuming that 
the instruction involves two source vectors, each subunit accepts two elements, 
performs its particular function, passes the result, and possibly tf e source 
operands, to the next subunit, and receives the next two elements from the 
stream of vector operands. Thus, because of simultaneous execution in the 
subunits, a set of identical operations may be executed^ very quickly even 
though the time for a particular single result is much larger. 

The execution time associated with any vector instruction is composed 
of two parts. The first depends on the hardware status and is assumed here 
to be constant; this part is usually known as the "start up" time. The second 
part is variable and is a function of the length of the vectors involved. 

The start up time is the delay which occurs between the issue of the instruc- 
tion and the appearance of the first result.' Factors affecting this delay 
include the need to read operands from storage before arithmetic can begin, the 
fact that the first operands must proceed through all the segments before the 
first result appears and hardware housekeeping. The variable component of the 
execution time consists of a constant "per result" time multiplied by the 
number of elements in the vector. 

In general vector instruction execution times may be expressed as: 

E = S + PL 
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where E = total time, S = start-up time, P = time per result and L = 
vector length. Since most pipelines operate synchronously, it is convenient 
to express times as multiples of the machine's basic processor cycle. In this 
paper it will be sufficient to consider only the operations of addition (compo- 
nentwise), multiplication (componentwise), transmit (move a vector within main 
storage), and inner product. 

2.2 Peripheral Storage 

In this analysis it is assumed that the bulk storage device is managed by 
a sophisticated satellite processor which operates independently from the cen- 
tral processor. In addition, in the interest of efficiency, it is further as- 
sumed that the user has complete control of the data layout on the storage device. 

For the problem to be described here, and in many large scientific compu- 
ting problems, the data bases in use are only referenced sequentially. This 
property is frequently used to improve the overall efficiency of the data 
transfer operations. In the model of a peripheral device which follows, the 
use of only sequential access allows some quantities to be treated as constants 
rather than as random variables. Also, the storage device being modeled is 
patterned after a disk although adjustment of the parameters allows the modeling 
of several devices. For example, setting the parameter corresponding to seek 
time equal to zero provides a reasonable model of a drum. Head switching time 
is not explicitly considered because it can be regarded as part of the rotational 
delay. 

The following parameters are used: 

D g The maximum seek delay when reading a large sequential file. 

This will correspond to the time required to move a disk head 
between adjacent cylinders. 

D d The disk rotation time. 
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Y That proportion of the rotation time which actually constitutes 

a delay when switching tracks. Careful layout of data or 
judicious hardware design can cause y to be very small. 

R The peak transfer rate which can be attained during the transfer 

of a single block of data. 

6 That proportion of the transfer rate which can be sustained 

during the transfer of a complete track. 

M The number of bits per track. 

C The number of tracks per cylinder. 

Assuming that a data set consists of B bits which are to be read or 
written sequentially, and that the disk arm is correctly positioned as the 
transfer operation begins, the time T required to transfer this data to or 
from a single device as modeled is approximately: 

(2.2.1) T = ( [~Bl -1)D time required to move head between 

| MC | ■ cylinders 

+ ( fB 
I M 

+ JL transfer time. 

R<$ 

For any given disk drive, the parameters y and 6 are extremely diffi- 
cult to quantify. They are greatly affected by the efficiency of the controlling 
software and many other factors. For the purposes of this paper it is assumed 
that the entire data file will be used sequentially by the algorithm. It is 
further assumed that the independent processor controlling the disk has suffi- 
cient capabilities to store and reorder, if necessary, a track of information. 
Consequently, in principle, reading a track load of data may begin anywhere on 


' 


rotational delay 
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the track; in practice reading will begin at the next available sector. Thus 
the maximum rotational delay is the time for one sector to pass under the read 
head. The estimate of y used here is one half of the sector size as a frac- 
tion of the track size. 

It is virtually impossible to give a reasonable estimate of 5. In this 
analysis it is assumed that 6=1 although in practice it will always be less. 

3. THE APPLICATION PROBLEM 
3.1 Choice of Problem 

In choosing a particular problem to investigate, two goals were kept 
in mind. One was to make a selection which was representative of actual 
problems to be solved on vector computers and which might be common to many 
different applications. The second goal was to choose a problem for which 
the computation and I/O demands were simple enough to permit a meaningful 
analysis. 

The problem that was chosen is one which is common to the many areas of 
science and engineering that require the numerical solution of partial 
differential equations; namely, solving a system of linear equations, 

Ax = b. The restrictions placed on the NxN matrix A are that it is real, 
symmetric, positive definite and banded (i.e., all nonzeros are clustered 
near the main diagonal). Because of the interest in the possible inefficiencies 
caused by I/O for problems too large for main memory, the order N of the 
matrix is assumed to be quite large, perhaps in the range of 5,000 to 50,000: 
-The bandwidth, 0 (see Figure 1), is usually some fractional power of N; for 
example, 6 is approximately /T for the problem of solving the Poisson 
equation on a square region. For a set of matrices arising in various practical 
problems, it was noted by Gibbs, Poole and Stockmeyer [1974] that the bandwidth 
averaged about 3 /IT. These values of 0, /¥ and 3 /"7f, are assumed to be 
representative for the purpose of this study. 
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SYMMETRIC 



The Matrix A 
FIGURE 1 


3.2 Algorithm and Timing 

For the solution of the linear system, Ax = b, a modification of 
Gaussian elimination is used. The method of solution can be described 
mathematically as follow?*: 

T r * 

(a) factor A - LDL where L is unit lower triangular and D is 
diagonal with positive diagonal elements; 

(b) solve Lz = b; 

(c) solve Dy = z; 

<d) solve L T x=y. 

On a serial computer, the factors can be found by any of several algorithms, , 
see, for example, Martin and Wilkinson [1965], Some of these have been analyzed 
for vector computers by Lambiotte [1975], One of the fastest algorithms for 
problems of the size of interest here is a vectorized version of symmetric Gaussian 
elimination. For this algorithm it is assumed that the lower half of A is stored 
by columns in a (0+1) by N array and will be overwritten by L. Thus the 
matrix in Figure 1 will be stored as: 



N 


There are N major stages of the algorithm, one for each column of 
the matrix. A flowchart combining the i^ stages of parts (a) and (b) is 
given in Figure 2. 


Set a component of vector D to 1/a^: 


Create a vector T of multipliers: 

(tj, . . . ,t p )*- d i*( a j+ij * • • • * a i+3,i ) 


Zero the elements (a i+1 j . , . . . ,a^ M ) in 

column i of A and update column i+j of A: 
Repeat for j from 1 to 

( a i+j ,i+j ’ ‘ ’ a i+3,i+j )<s “ ( a i+j ,i+j * “ ' a i+3,i+j } 

“ ' a i+j,i*^ t j’* , ‘ ,t 3^ 


Create column i of L = {&/•) stored ■ i.n. column i of A: 

* 1 J 

( a 1+1 >i .... .a i+g>i ) <? (t 1 1 . . . ,t p j 


Solve Lz = b: 

^1 t 3^ <? ^i*^i+l,i A i+3,i ^ 


Produce z^ +1 stored in b. +1 : 


(b i+1 ,. ..,b i+0 )< ( b j+i»--** b i + 3) “ 

(tj , • . . ,t^) 
6 


Flowchart of i L " Stage 


FIGURE 2 
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Parts (c) and (d) can be implemented in a straight forward manner. For 
further detail the reader should see Appendix A which contains a program 
for the entire algorithm. 

The timing formulas that follow are applicable to any computer that has 
the characteristics described in Section 2.1. An outline of the derivation of 
the timing formulas is given in Appendix B. 

Formula (3.2.1) is for parts (a), (b) and (c) and corresponds to procedure 
BANSYG in Appendix A; formula (3.2.2) is for part (d) and procedure UPBSOL. 

They are given in units of machine cycles. The DA term refers to the time 
required to perform the scalar divide and for storage of - l's in steps such 
as (1) and (2) of procedure BANSYG. The start up times for addition, multipli- 
cation, transmit and inner product are designated by S A , $ M , Sy and Sj, respec- 
tively 1 , the per result times are denoted by P A , P^, Py, and Pj. 

■(3.2.1) T b (N,(3) = .5(P m +P a )NB 2 + (S M +S A +2.5P M +1.5P A +P T )NB 

+ (DA+2S M +S A +S T +P M )N - .333(P M +P A )B 3 - . 5(S M +S A +3P M +2P A -! Py) 3 2 
- .5(S^+S A +2.333P^+1.333P A +Py)3 - (S^+S A +Sy). 

(3.2.2) T y (N,3) = PjN3 + (Sj+Pj)N - .SPjB 2 - .5Pj3 - (Sj+Pj). 

3.3 Data Management 

None of parts (a), (b), (c) or (d) of the algorithm requires that all of 
the matrix A be in main memory at one time. By examining step 3 of the flow- 
chart in Figure 2, it is clear that for the ith stage of the factorization only 
columns i, i+l,...,i+3 are required. Furthermore, after step 6 of the 
flowchart is executed, column i is not needed again in parts (a) or (b). Thus 
while computation is being performed at the ith stage, the (i-l)st column may be 
removed from memory and the (i+3 + l)st column may be brought in (see Figure 3). 
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SYMMETRIC 



Data Flow for the Factorization 


FIGURE 3 


In summary, the requirements on main memory for parts (a), (b) and (c) 
consist of an array of size (0+1) 2 for the active columns of A; at least two • 
buffers of size 0+1: one for the column of A for which computation ha$ been 
completed and one for the column for which computation has not yet begun; two 
vectors of length N for D and b; and one of length 3 for the temporary 
vector needed. Thus the minimum main memory required for efficient operation is 

(3.3.1) (0+1) 2 + 2N + 30 + 2 words. 

It is assumed that the data is laid out on the peripheral storage device 

in the following manner. Starting with the first column of the matrix, columns 

» 

are stored sequentially on a track. Tracks are then filled in sequence. For 
a device organized as a set of concentric cylinders, after a track is filled, 
the next track on the same cylinder is used. Upon filling an entire cylinder, 
the adjacent cylinder is used. 

The algorithm effectively treats a column as the logical record size, 
although in practice one track of data, perhaps containing more than one column, 
would be transferred as a single physical record. This allows maximum device 
utilization. It is assumed that the satellite processor which manages the 
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peripheral storage device will reformat the tracks of data into the columns 
required by the algorithm. 

It should be pointed out that step 3 of the flow chart actually requires 
that only two columns of A and a temporary vector be in main memory at one 
time; however, this would increase the amount of data to be moved in one stage 
of the factorization to 3(3+1) words. The time required to move this amount 
of data is at least an order of magnitude larger than the corresponding compu- 
tation time. Furthermore, the algorithm used here will work if only the region 
above the dotted line in Figure 3 is in main memory; however, this makes it 
impractical to keep all of the elements of a given column in contiguous storage 
locations, a requirement for maximum efficiency of vector operations. 

Ordinarily the factorization and the lower solve are organized as separate 
modules because of their independence; this requires two complete passes through 
the data. However, in the flowchart, the lower solve is embedded in the 
factorization module so that only one pass is necessary. 

Part (d) of the algorithm requires another pass through the matrix with 
the columns required in reverse order, one at a time. It should be pointed out 
that under the assumptions of Section 2.2, this causes only minor problems for 
the satellite processor since data management can take place in its own storage. 
Thus the reverse order requirement is met by reading the tracks in reverse order 
and using a buffer to sort out the data taken from a given track. 

4. RESULTS FOR TYPICAL COMPUTERS 

The various assumptions made on hardware in Section 2 are satisfied by the 
CDC STAR-100 and the TI ASC. - The only assumption made about system software is 
that it does not interfere with the user's control of relevant hardware features 
as described in Sections 2 and 3. In particular there are no other active tasks 
competing for resources. 
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This section includes detailed timings of computation and I/O for the two 
computers. These are based on the procedures BANSYG and UPBSOL found in Appendix 
A. Procedure BANSYG corresponds to parts (a), (b) and (c) of the algorithm given 
in Section 3.2; procedure UPBSOL corresponds to part (d). For the sake of sim- 

4 

pl i city the procedure names are used throughout this section. 

4.1 CDC STAR- 100 

' The STAR-100 central processing unit (CPU) is equipped with two arithmetic 
pipelines although the user has no control over how they are utilized for a parti c 
ular instruction. Table 1 contains timing information from Control Data Corpo- 
ration [1974A] for the instructions pertinent to this study. The times are in 
CPU cycles and are for 64 bit operands. 

h - 94 p A ■ * 

= 156 = 1 

S T = 90 P T = h 

Sj » 100 Pj = 4 

STAR- 100 Instruction Timings 

(In CPU cycles of 40 nanoseconds) 

TABLE 1 

Based on scalar instruction timing, DA is estimated to be 106. If these • 
times are used in expressions (3.2.1) and (3.2.2), the following timing 
formulas are obtained for the procedures BANSYG and UPBSOL respectively: 

T B (N,3) = .75N3 2 + 253.75N3 + 603N - .53 3 - 127.253 2 - 126.753 - 340 

Tjj(N ,3 ) = 4N3 + 104N - 23 2 - 23 - 104 

Table 2 presents the time required for computation for the two procedures for 
various values cf N and 3 . It should be emphasized that the times are based 
on. manufacturer 's documentation rather than on actual computer timings. 
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B A N S Y 

G 

U 

P B S 0 

L 

Computation 

Input/Output 

Computation 

Input/Output 

N 

B 

844 

819 

844 

819 

5,000 

70 

4.38 

4.73 

0.70 

0.076 

4.73 

0.70 

5,000 

210 

16.03 

14.07 

2.09 

0.185 

14.07 

2.09 

10,000 

100 

13.32 

13.47 

2.01 

0.201 

13.47 

2.01 

10,000 

300 

56.69 

40.17 

5.99 

0.514 

40.17 

5.99 

25,000 

160 

60.19 

53.72 

8.02 

0.742 

53.72 

8.02 

25,000 

480 

291.82 

160.50 

23.96 

2.006 

160.50 

23.96 

50,000 

225 

190.84 

150.82 

22.52 

2.004 

150.82 

22.52 

50,000 

675 

1018.73 

451.13 

67.36 

5.571 

451.13 

67.36 


Computation and Input/Output Times (in seconds) for STAR-100 


TABLE 2 

Two disk drives are available for use with the STAR-100. They are 
CDC models 844 and 819, and for these drives the values of the parameters used 
in the model of section 2.2 are given in Table 3, as taken from Control Data 
Corporation [1974B]. 



844 

819 

°s 

10 ms. 

15 ms. 

°R 

33 ms. 

33 ms. 

R 

6.8xl0 6 bps. 

38xl0 6 bps. 

M 

9.8xl0 4 bits 

52.4xl0 4 bits 

C 

19 

10 


STAR-100 Disk Characteristics 

TABLE 3 
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An 844 disk track is divided into 3 sectors; thus y is taken to be 1/6. 

An 819 disk track consists of 16 sectors and y is taken to be 1/32. 

It is assumed that separate dedicated disks and associated controlling 
equipment are available for the input of the matrix A and for the output of 
the matrix L of the factorization. This permits the data to be transferred 
with minimal disk arm movement. The times required in seconds in terms of N 
and 3 are given by the following expressions obtained from formula (2.2.1). 
T 84 4 <N,$) = {f(3.44xlO-5)N(3+l)"| - 1)(10“ 2 ) 

+ { T(6. 53xlO -4 )N(3+1 )”] - 1) (5. 5xl0- 3 ) + (9.41xl<r 6 )N(8+l) 
T 8 1 9 (N,3) = ( f( 1. 22xlO”^)N(3+l)"| - 1 )( 1 . 5x10-2) 

+ ( f"(l.22xl0"4)N($+lf] - 1) (1.03xl0“3) + (1.68xlO" 6 )N(3+l) 
Table 2 contains the total I/O time for the two procedures. 

In Table 2, the I/O times for the 844 disk exceed the computation times for 
procedure BANSYG in some cases. Furthermore, these I/O time estimates may be 
overly optimistic because of the ideal conditions assumed about the secondary 
storage environment (e.g., 6=1 in (2.2.1)). One approach to handling this proba- 
ble 1/0-computation imbalance is multiprogramming. However, the limited amount 
of main memory available on the STAR-100 (either 524K or 1048K words) may make 
multiprogramming impractical: for most of the cases shown in Table 2, the main 
memory requirement of procedure BANSYG is a significant portion of the smaller 
memory. For the larger memory, multiprogramming is more feasible for handling 
several of the smaller problems. 

If 819 disk drives are used, the question of multiprogramming is not as 
important for procedure BANSYG since the columns of the matrix can be read or 
written at a rate which exceeds the computation by a reasonable margin. Care- 
ful buffering will permit the I/O operations to totally overlap the processing. 
However, it must be noted that separate drives may still be needed for each of 
the two data streams in order to avoid disk arm movement. 
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The main storage requirement of procedure UPBSOL consists of just one 
row of the matrix and one vector, N + B + 1 words. Thus this phase uses rela- 
tively little processor time or main memory but the data input requirements are 
the same as for procedure BANSYG. The entire matrix must be read sequentially 
into main storage with the rows in reverse order as the processing proceeds. 

Even the 819 disk unit is too slow by an order of magnitude on this procedure. 

If multiprogramming involving processes of this type is to be used to increase 
processor utilization then sufficiently many independent high performance disk 
drives must be available to permit each process to have dedicated drives. 

4.2 TI ASC 

The central processor of the ASC is available with up to four pipelines and 
in this analysis a four pipeline configuration is assumed. In addition the way 
in which the pipelines are used on a given sequence of instructions is deter- 
mined by the programmer (or compiler). A single vector instruction may be exe- 
cuted by a single pipeline, or the operand vectors may be split and shared 
between the pipelines. In this analysis, it is assumed that all vector instruc- 
tions are shared. With this assumption, the timing of vector instructions in 
terms of CPU cycles for 64 bit operands is shown in Table 4. This data is from 
Texas Instruments, Inc. [1973AJ. 


S A 

= 109 

P A = 

7/16 

S M 

= no 

V 

3/4 

S T 

= 109 

P T 

7/16 

S I 

= 120 

V 

1 


ASC Instruction 

Timings 


(In 

CPU cycles of 60 nanoseconds) 


TABLE 4 
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Based on scalar instruction timing, DA is estimated to be 27. If 
these times are used in expressions (3.2.1) and (3.2.2), then the follow- 
ing timing formulas are obtained for the procedures BANSYG and UPBSOL 
respectively. 

T b (N,B) = .594N3 2 + 221.969N3 + 465. 75N - . 39603 
.- 111.2813 2 - 110.8853 - 328 
T y (N , 6 ) = m + 121N - .53 2 - .53 - 121 

Table 5 presents the time required for computation for the two procedures for 
various values of N and 3. Again, the times are not based on actual computer 
timings. It should be noted that the same algorithm is used for the ASC as for 
the STAR. Because the relative speeds of instructions may vary from computer 
to computer, the algorithm used here may not be the fastest for the ASC. Con- 
sequently, nr conclusions regarding the relative speeds of the two computers 
should be drawn from the data of Tables 2 and 5. 


N 

3 

BANSYG 

U 

P B S 0 

L 

Computation 

Input/Output 

Computation 

Input/ Output 

PAD 

HPT 

PAD 

HPT 

■H i 

70 

5.63 

4.73 

1.52 

0.057 

4.73 

1.52 

5,000 

210 

21.46 

14.07 

4.51 

0.098 

14.07 

4.51 



17.07 

13.47 

4.32 

0.132 

13.47 

4.32 



71.05 

40.17 

12.87 

0.250 

40.17 

12.87 

25,000 

160 

76.50 

53.72 

17.21 

0.421 

53.72 

17.21 

25,000 

480 

361.55 

160.50 

51.40 

0.895 

160.50 

51.40 

50,000 

225 

240.79 

150.82 

48.31 

1.036 

150.82 

48.31 

50,000 

675 

1252.12 

451.13 

144.49 

2.374 

— 

451.13 

144.49 


Computation and Input/Output times (in seconds) for ASC 
... . . TABLE 5 








The two disk drives available with the ASC are the Head Per Track Disk (HPT) 
and the Positioning Arm Disk (PAD). They have the following parameter values (see 
Texas Instruments, Inc. [ 1973B ] ) : 

PAD HPT 

D s 10 ms. 0 

Dp t 33 ms. 34 ms. 

R 6.8xl0 6 bps. 15xl0 6 bps. 

M 9.8xl0 4 bits 5.2xl0 5 bits 

C 19 NA 

ASC Disk Characteristics 
TABLE 6 

Each track of the PAD disk is divided into 3 sectors; thus y is taken to 
be 1/6. An HPT disk track is divided into 256 sectors and y is taken to be 
1/512, Note that D s is zero for the HPT disk so that the first term of expres- 
sion (2.2.1) vanishes. 

If it is assumed that the data is transferred in a continuous stream with 
two independent dedicated disk units, then the time required in terms of N and 
3 is given by the following expressions for the PAD and HPT disks respectively. 

T P AD<M) = ([(3.44xlO" 5 )N(3+l)1 - 1)(10" 2 ) 

+ (["(6.53xlO“ 4 )N(3+l)] - 1)(5. 5x10-3) 

+ (9.41xlO*" 6 )N(B+l) 

T hpt (N,B) = (f(1.23xlO“ 4 )N(8+l)1 -1)(6. 64x10-5) 

+ (4.27x10“ 6 )N(8+1) 

Table 5 contains the total I/O times for the two procedures. 

The overall results are similar to those obtained for the STAR-100. A 
major difference however is the availability of much larger main storage sizes 
for the ASC. It can be delivered with up to 8 million 64 bit words which exceeds 
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the maximum STAR-100 memory by a factor of 8. This has two important implica- 
tions: first, multiprogramming of large scale scientific programs is much more 
feasible; and secondly, for the smaller values of N and 0, the entire factored 
matrix can be held in main storage. The latter alternative eliminates the need 
for the output stream for procedure BANSYG and the input stream for procedure 
UPBSOL. Because it is procedure UPBSOL which is severely limited by its input 
requirements, if the entire matrix is available in storage that phase can proceed 
uninterrupted. 

5. CONCLUSIONS 

Achieving adequate input/output rates is a significant problem with high 
performance vector computers. The analysis presented above indicates that 
currently available disk units are able to provide data at a sufficient rate 
only in cases when a sizable amount of computation is performed with each item 
of data. For instance, in procedure BANSYG 0(3) computations are performed with 
each element of the matrix, whereas there are only 2 computations per element in 
procedure UPBSOL. Iri the general mix of jobs at a scientific computing facility, 
one might encounter many problems whose order of computation is similar to that 
of UPBSOL. 

An approach to handling the I/O imbalance for third generation computers is 
multiprogramming. It was indicated in Section 4 that multiprogramming the 
STAR-100 is not always practical for the problems considered here, whereas it 
may be fruitful for these problems on the ASC because of the availability of a 

i 

much larger main memory. 

Another possible solution is the use of multiple disk drives for each 
data stream. Since only sequential files need be considered here, the necessary 
synchronization is possible in principle. A longer term solution may be offered 
by memories using the newer technologies such as magnetic bubbles. 

Perhaps an even greater problem than supplying the central processor with 
data is that of supplying the entire system. Once the computations have been 
completed for sets of data residing on disks, new data sets must be transferred 
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to the disk system from other peripherals. For example, ten of the above problems 
with N = 10,000 and g = 300 will fill a n 819 disk, yet all computation will 
be completed in approximately ten minutes. Loading new data sets onto an 819 
will take considerably longer than this. Very little attention seems to have 
been given to this question. 

Finally, since there is an excess of computing power and a lack of 1/0 

capabilities, it would behoove the designers of numerical algorithms to consider 

techniques which reduce 1/0 requirements at the, perhaps considerable, expense 

* 

of extra computation. For example, by examining Table 2, one sees that an 1/0 
decrease of 50% and a computation increase of fivefold would halve the total 
time required by procedure UPBS0L on the STAR- 100. Note that the 1/0 time 
requirements were reduced by 1/3 simply by organizing the factorization and 
the lower solve into one module, thus eliminating the need for a third pass 
through the matrix. 
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APPENDIX A. PROGRAM FOR ALGORITHM 


The algorithm is presented in a new programming language called 
SL/1 which is being developed explicitly for the CDC STAR-100. The details 
of SL/1 may be found in Baslli and Knight [1975); however, pertinent points 
are included here so that the reader may follow the algorithm. 

All declarations in SL/1 are explicit and are similar to those in ALGOL. 

Thus 

REAL VECTOR [2..BETA+1] T; 

is the declaration of a vector named T with BETA real components subscripted 
from 2 to BETA + 1. The lower subscript bound may be omitted in which case 
it defaults to one. 

The other declaration of interest in SL/1 is used to describe an array 
of vectors: 

REAL VECTOR [BETA+1) ARRAY (N) A; 

This declaration describes a data structure named A consisting of N vectors 
of real numbers each of length BETA + 1. SL/1 also permits the user to 
reference subvectors and subarrays; thus, for example 
A ( I ) [2..N-I+1] 

1L 

refers to elements 2 through N - I + 1 of the I tn vector of array A. 

The only nonstandard arithmetic operator of SL/1 used here is the inner 
product. Thus the statement: 

B[IJ := -U(I)[1..N-I+1) .DOT. B[I..N] 

stores the inner product of the negative of the first N - I + 1 components 
of the I** 1 vector of the array U with the I through N** 1 elements of the 

i,L 

vector B in the I in location of vector B. 

The program is divided into two procedures: BANSYG implements parts (a), 

(b), and (c) of Section 3.2 and UPBSOL corresponds to part (d). It should be 

pointed out that both procedures consist almost entirely of vector operations, 

and are thus extremely well suited for a vector processor. 
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PROCEDURE BANSYG; 

/♦THIS PROCEDURE PERFORMS AN LDL T FACTORIZATION OF THE GIVEN SYMMETRIC, 
POSITIVE DEFINITE, N BY N MATRIX A WITH (SEMI) BANDWIDTH BETA. IT ALSO 

SOLVES LZ = B AND DY =. Z. THUS TO COMPLETE THE SOLUTION OF AX - B, 

ONE MUST SOLVE L T X = Y. 

THE LOWER PART OF A IS STORED BY COLUMN, THE LOWER TRIANGLE, L, IS 

COMPUTED A COLUMN AT A TIME AND IS STORED OVER A. THE VECTOR D 

CONTAINS THE INVERSES OF THE DIAGONAL ELEMENTS OF THE MATRIX D. UPON 
EXIT, THE VECTOR B CONTAINS THE ELEMENTS OF Y, OVERWRITING THE RIGHT 
HAND SIDE OF THE ORIGINAL LINEAR SYSTEM.*/ 

REAL VECTOR [BETA+1] ARRAY (N) A; 

REAL VECTOR [N] B,D; 

REAL VECTOR [2.. BETA+1] T; 

INTEGER BETA.I ,J,N; 

/♦FIRST PERFORM THE ELIMINATION FOR COLUMNS 1 ,2 , • • • ,N-BETA~1 */ 

FOR I FROM 1 TO N - BETA - 1 DO 

(1) DII] := 1 - 0/A( I ) [1] ; 

(2) A(I)[1] := -1.0; /* REDEFINE DIAGONAL ELEMENTS 

FOR USE IN PROCEDURE UPBSOL */ 

(3) T := D[I]*A(I){2.. BETA+1]; /* CREATE THE MULTIPLIERS */ 

FOR J FROM 1 TO BETA DO /* MODIFY RELEVANT ELEMENTS */ 

(4) A(I+J) [1. .BETA-U+1] :*= A( I+U) [1. .BETA-J+1] 

- T[J+1..BETA+1]*A(I) 10+13; 

ENDF J; 

(5) A(I)[2. .BETA+1] := T; 

(6) T := B[I}*A(I) [2.. BETA+1]; /* NOW SOLVE LZ = B FOR Z[I+1] */ 
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(7) B[I+1. . I+BETA] := B[I+1. . I+BETA] - T; /* Z[I+1] IS IN B[I+1] */ 
ENDF I; 

f 

/* NEXT PERFORM THE ELIMINATION OF THE DENSE TRIANGULAR BETA+1 SYSTEM 
IN THE LOWER RIGHT CORNER */ 

FOR I FROM N - BETA TO N - 1 DO 

(8) D[I] := 1 . 0/A( I ) [ 1 ] ; 

(9) A( I ) [ 1 3 := “1.0; 

(10) T[2. .N-I+l] := D[I]*A(I)[2. .N-I+l]; 

FOR J FROM I + 1 to N DO 

(11) A(U)[1. .N-J+l] :- A(J) [1. .N-J+l] 

- T[J-I+1. .N-I+1]*A(I) [J-I+l] ; 

ENDF J; 

(12) , A(I)[2. .N-I+l] ;= T(2. .N-I+l] ; 

(13) T (2. . N-I+l ) := B[I]*A(I)[2. .N-I+l]; 

(14) B[I+1..N] : = B[I+1..N] - T[2. .N-I+l]; 

ENDF I; 

, (15) D[N) := 1 . 0/A(N) [ 1 3 ; 

(16) A(N) [1] := -1.0; 

/* FINALLY SOLVE DY = Z */ 


(17) B := D*B; 
ENDP BANSYG; 
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PROCEDURE UPBSOL; 

/* THIS PROCEDURE SOLVES THE UNIT UPPER TRIANGULAR SYSTEM UX=B. 

STORAGE IS SIMILAR TO BANSYG. THE RESULT, X, IS OVERWRITTEN ON B. */ 

REAL VECTOR [BETA+1] ARRAY (N) U; 

REAL VECTOR [N] B; 

INTEGER BETA, I, N; 

/* FIRST SOLVE DENSE LOWER RIGHT CORNER TRIANGLE */ 

FOR I FROM N - 1 TO N - BETA + 1 DO 
(1) B[I] := -U ( I ) [ 1 . .N-I+l] .DOT. B[I..N]; /*’U(I)[1] CONTAINS -1.0 */ 

ENDF I ; 

/* NEXT SOLVE REMAINING BAND */ 

FOR I FROM N - BETA TO 1 DO 
(2} B[I] := -U(I) .DOT. B[I. .BETA+I]; 

ENDF I ; 

ENDP UPBSOL; 
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APPENDIX B. TIMINGS OF PROCEDURES 


This appendix contains an outline of the derivation of the timing 
formulas (3.2.1) and (3.2.2). T.. represents the timing of statement i 
in procedure BANSYG. 

B.l Timing of BANSYG 
T x + T 2 = DA 

T 3 = S M + P M 3 

T 4 = S M + P M (0-j+l) + S A + P A (0-J+1) 

T 5 “ S T + P T 3 
T 6 = S M + P M 3 
T 7 = S A + P A 3 
T 8 + T 9 = DA 

T 10^V 

T 11 = S M + P M (N - j+1) + S A + P A^-j +1 ) 

T 12 = S T + p T (N>i) 

T 13 = S M + P M^ N_1 ^ 


T 14 = S A + p A<"-*> 


T 15 + T 16 = DA 
T 17 = S M + P M N 
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* 


Thus the total computation time required by BANSYG is 


T b (U,3) = 


N-6-1 

I V 

T 2 + T 3 

*(i T 0 

1=1 


J=1 

N-l 

rv 

T 9 + T 10 

/ N 

HZ T n 

i=N~3 


3=1+1 

15 + T 16 

+ T 17‘ * 



After considerable manipulation, it can be shown that 

T b (M) = . 5(P m +P a )N 3 2 + (S M +S A +2.5P M +1.5P A +P T )Nf3 
+ (DA+2S M +S A +S. T +P M )N - ;333 (P m +P a )3 3 
- .5(S m +S a +3P m +2P a +P t )3 2 
-..5(S M +S A +2.333P M +1.333P A +P T )f3 - (S M +S A +S T ). 

B.2 Timing of UPBSOL 



'N-3+1 


" 1 

Ty(N ,3) = 

T Sj + Pj(N-1+l) 

+ 

Y s i + p i (e + D 


_i=N~l 


_i=N-3 


= PjNB + (Sj+Pj)N - .5PjB 2 = .5Pj0 - (Sj+Pj), 
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